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David  Nassimi  and  Sartaj  Sahnl 


Abstract 

In  this  paper,  we  develop  an  algorithm  to  perform  BPC  permutations  on  a 
cube  connected  SIMD  computer.  The  class  of  BPC  permutations  includes 
many  of  the  frequently  occurring  permutations  such  as  matrix  transpose, 
vector  reversal,  bit  shuffle,  and  perfect  shuffle.  Our  algorithm  is 
shown  to  be  optimal  in  the  sense  that  it  uses  the  fewest  possible  num¬ 
ber  of  unit  routes  to  accomplish  any  BPC  permutation. 
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1*  Introduction 

An  SIMD  (Single  Instruction  stream  Single  Data  stream)  computer  Is  a 
parallel  computer  consisting  of  a  large  number  of  identical  processing 
elements.  A  block  diagram  of  such  a  computer  is  given  in  Figure  1.  SIMD 
computers  have  the  following  characteristics: 

(1)  They  consist  of  N  processing  elements  (PEs) .  The  PEs  are 
indexed  0,  1,...,  N-l  and  an  individual  PE  may  be  referenced 
as  in  PE(i).  Each  PE  is  capable  of  performing  the  standard 
arithmetic  and  logical  operations.  In  addition ,  each  PE  knows 
its  index. 

(2)  Each  PE  has  some  local  memory. 

(3)  The  PEs  are  synchronized  and  operate  under  the  control  of  a 
single  instruction  stream.  This  instruction  stream  is  generated 
by  the  control  unit  which  has  access  to  the  program  that  is  to  be 


(4)  An  enable/disable  mask  can  be  used  to  select  a  subset  of  the  PEs 
that  are  to  perform  an  instruction.  Only  the  enabled  PEs  will 
perform  the  instruction.  The  remaining  PEs  will  be  idle.  All 
enabled  PEs  execute  the  same  instruction.  The  set  of  enabled  PEs 
can  change  from  instruction  to  instruction. 

The  essential  feature  that  distinguishes  one  SIMD  computer  family  from 
another  is  the  interconnection  network.  In  this  paper,  we  are  concerned 
only  with  two  types  of  interconnection  networks:  the  mesh  and  the  cube. 

(i)  Mesh  Connected  Computer  (MCC) 

In  this  model,  the  PEs  may  be  thought  of  as  being  logically 
arranged  as  in  a  two  dimensional  array  ACn^,^)  where  N  -  n^  *  n^ 
The  PE  at  location  A(i,j)  is  directly  connected  to  the  PEs  at 
locations  A(i  ±  l,j)  and  A(i,j  ±  1)  (provided  these  PEs  exist). 
Figure  2(a)  shows  the  interconnections  in  a  4  x  4  MCC. 

(ii)  Cube  Connected  Computer  (CCC) 

Assume  that  the  number  of  PEs,  N,  is  a  power  of  2.  So,  N  «  2^. 
Let  i q_^...ig  be  the  binary  representation  of  1,  i  e  [0,N  -  1) 

and  let  i^  denote  the  number  whose  binary  representation  is 
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Figure  1_  Block  diagram  of  an  SIMD  computer 

iq-l*##ib+l  *b  ^Ts-l^’^O  where  is  the  comPlement  of  \  and  0  <  b  <  q. 

(2) 

Hence,  if  i  has  the  binary  representation  10110,  then  i  has  the  repre¬ 
sentation  10010  and  i^  has  the  representation  10111.  In  a  cube  connected 
computer,  PE(i)  is  connected  to  PE(i^),  0  £  b  <  q.  Figure  2(b)  shows  the 
PE  interconnections  in  an  8  PE  CCC. 

It  is  important  to  note  that  PEs  can  communicate  only  via  the  inter¬ 
connection  network.  Besides  the  mesh  and  cube  connections,  several  other 
connections  schemes  are  possible.  The  reader  is  referred  to  Siegel 
([SIEG79a]  and  [SIEG79b])  for  a  survey  of  interconnections  networks  for 
SIMD  computers.  The  largest  SIMD  computer  currently  under  construction 
is  the  massively  parallel  processor  (MPP)  being  build  by  Goodyear  Aerospace 
Research  [BATC80].  This  machine  uses  the  mesh  interconnection  scheme 
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(together  with  some  variations)  and  will  have  16 >384  PEs. 

An  important  problem  that  arises  in  SIMD  computers  is  that  of  data 
routing;  moving  data  from  one  PE  to  another.  While  there  are  several  forms 
of  the  data  routing  problem  [NASS81b],  we  shall  deal  only  with  the  permu¬ 
tation  form.  In  a  permutation  problem,  PE(i)  wishes  to  send  data  to  PE(A(i))> 
0  z  i  <  N  where  [A(0) , . . . ,A(N-1) ]  is  a  permutation  of  [G,l> . . . ,N-1] . 

Arbitrary  data  permutations  are  generally  accomplished  by  sorting.  For 
certain  classes  of  permutations,  however,  their  exist  algorithms  that  are 
more  efficient  than  sorting  [NASS81a] .  One  such  class  is  the  BPC  (bit 
permute  complement)  class  of  permutations  introduced  in  [NASS80].  A  per¬ 
mutation  A  is  a  BPC  permutation  iff  it  can  be  described  by  a  vector 

B  *  [B  . ,B  Ba]  (where  N  *  2q  is  the  number  of  PEs),  such  that: 

q— l  q-z  U 

(a)  B.^  e  {±0,11, . . .  ,±(q-l)  } ,  0  Z  i  <  q,  and 

(b)  1s  a  Permutation  of  [0,1, . . . ,q-l] 

The  destination  d  of  the  data  in  PE(i)  can  be  computed  from  this  vector 

B  as  follows.  Let  1  ift  be  the  binary  representation  of  1  and  let 

q— 1  0 

dq_2*...,dQ  be  the  binary  representation  of  d.  For  j  ■  b,  l,...,q-l,  we 
have: 

Ji  if  B  a  0 

lf  bj  <0 

Note  that  we  distinguish  between  40  and  -0  and  that  -0  <  40  •  0.  Also, 
note  that  the  total  number  of  permutations  that  can  be  specified  in  this  way 
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is  2^q!  *  N(log  N) ! 

Intuitively,  for  each  BPC  permutation  A  specified  by  B,  the  destination 
PE  for  PE(i)  is  obtained  by  permuting  the  bits  in  the  binary  representation 
of  i  and  complementing  certain  bits.  The  vector  B  specifies  how  the  bits  are 
to  be  permuted  and  also  which  bits  are  to  be  complemented.  |B^|  tells  us 
where  bit  j  is  to  go,  and  the  sign  of  B^  tells  us  if  the  jth  bit  of  i  is  to 
get  complemented. 

As  an  example,  consider  the  case  N  *  16,  q  *  4  and  B  *  [-0,3, -1,-2] . 

The  data  from  PE(i),  i  *  i^i^i^ig,  is  t0  be  routed  to  PE0)  where  j  *  -1 3J 2*^1^0 
0  <  i  <  N.  The  BPC  class  of  permutations  includes  most  of  the 
permutations  that  commonly  arise.  Table  1  gives  the  B  vectors  corresponding 
to  several  popular  permutations. 


Permutation 

Vector  Representation 

Matrix  Transpose 

Bit  Reversal 

Vector  Reversal 

Perfect  Shuffle 

Uhshuffle 

Shuffled  Row  Major 

Bit  Shuffle 

[q/2  —  1, * . . »0,q  —  1, . . .  ,q/2] 

[0,1,2, . . . ,q  —  1] 

[~(q  -  l),-(q  -  2 -0 ] 

[0,q  -  l,q  -  2, . . . ,1] 

[q  “  2,q  -  3,...,0,q  -  1] 

[q  -  l,q/2  -  l,q  -  2, q/2  -  2,...,q/2,0] 

[q  ~  l»q  —  3, . . . ,l,q  -  2,q  -  4,...,0] 

Table  1  Some  common  permutations 


Nasslmi  and  Sahni  [NASS80]  present  an  optimal  algorithm  for  routing 
BPC  permutations  on  mesh  connected  computers.  In  [NASS81a],  they  show  how 
BPC  permutations  may  be  performed  efficiently  on  a  CCC.  There  algorithm  is, 
however,  suboptimal  in  the  sense  that  for  some  BPC  permutations  more  data 
movement  may  take  place  than  necessary.  In  this  paper,  we  develop  an  optimal 
algorithm  for  routing  BPC  permutations  on  a  CCC. 

2.  Optimal  BPC  Algorithm 

Let  b  e  [0,q-l].  In  a  unit-route  (on  a  CCC),  data  can  be  moved  from 
PE(i)  to  PE(i^\  0  £  i  <  N.  Let  B  ■  [B^^, . . .  ,Bq]  be  the  vector  repre¬ 
sentation  of  the  BPC  permutation  A  “  [A(0) , • • • ,A(N-1) ] •  We  first  obtain  a 
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lower  bound,  6(B),  on  the  number  of  unit-routes  needed  to  perform  A  on  an 
N  PE  CCC. 

Theorem  1:  Let  B  *  [B^^, . . .  ,Bq]  define  the  BPC  permutation  A  *  [A(0),..., 
A(N-l)].  6(B)  as  given  below  is  a  lower  bound  on  the  number  of  unit-routes 

needed  to  perform  A  on  a  CCC. 

3(B)  -  |{b|Bb  +  b} | 

Proof:  For  each  b  for  which  B,  ^  b,  there  exists  at  least  one  A(i)  with 

— —  D 

the  property  that  i^  i  (A(i))^  ((A(i))^  denotes  bit  b  of  A(i)).  Thus,  at 
least  one  unit-route  along  bit  b  is  needed.  So,  at  least  6(B)  unit-routes 
are  needed  to  perform  A.  D 

By  making  minor  modifications  to  the  routing  algorithm  presented  in 
[NASSdla] ,  we  can  arrive  at  an  algorithm  that  performs  each  BPC  permutation 
B  using  26(B)  -  1  unit-routes.  The  algorithm  we  are  about  to  present  will 
use  exactly  6(B)  unit-routes  and  is  therefore  optimal. 

Our  algorithm  follows  the  cycles  present  in  the  bit  permutation  B. 

If  (k^,k^t • • • *kp)  are  the  bits  in  a  cycle  of  A  then  our  algorithm  first 
routes  all  data  to  PEs  having  the  correct  final  value  for  bit  k^  (i.e. 
following  this  route  the  destination  for  the  data  in  PE(i)  is  such  that 
(D^)^  *  (i)^  )•  Next,  we  route  along  k^9  then  k^,  etc.  Having  finished 

with  this  cycle,  the  next  permutation  cycle  is  followed  and  so  on. 

Let  us  first  consider  an  example.  Consider  performing  a  perfect  shuffle 
on  a  CCC  with  8  PEs.  B  -  [0,2,1]  and  the  destination  A(i)  for  the  data  in 
PE(i)  has  the  binary  representation  i^i^i^  (note  that  the  binary  repre¬ 
sentation  of  i  is  #  e^ements  t0  be  permuted  are  assumed  to  be  in 

register  R  of  each  PE.  B  has  only  one  cycle  (1,2,0)  *  (B^jB^jB^)*  The 
first  route  is  along  bit  1,  then  along  bit  j,  the  route  is  done  only  for 
those  PEs  containing  data  with  destination  A(i)  such  that  i^  i  (A(i))^.  In 
our  example,  when  routing  along  bit  1,  we  need  to  route  only  data  from  PE(i) 

with  i^  +  ig.  This  is  so  because  (A(i))1  -  i^  and  If  i^  -  i^  then  the  data 
in  PE(i)  is  already  in  a  PE  with  the  right  bit  Data  to  be  routed  is 

moved  to  a  routing  register  S,  and  the  route  performed. 

We  shall  use  the  following  notation  and  assumptions  in  specifying  our 
permutation  algorithm: 


Figure  3^:  Perfect  Shuffle  on  a  Cube 

(1)  Each  PE  has  two  registers  R  and  S.  Both  these  registers  are  large 
enough  to  hold  the  data  being  routed.  R(i)  and  S(i)  refer  to  the 
corresponding  registers  in  PE(i). 

(2)  Three  types  of  assignments  will  be  used: 

(a)  :*  will  be  used  for  assignments  requiring  no  routing.  For  example, 
R(i)  :*  S(i)  (both  R  and  S  are  in  the  same  PE). 

(b)  will  be  used  for  exchanges  requiring  no  routing.  R(i)  S(i) 
results  in  the  R  and  S  registers  of  PE(i)  interchanging  data. 

(c)  «-  will  denote  an  assignment  requiring  a  route.  We  shall  require 
that  the  PEs  denoted  by  the  left  and  right  hand  sides  be  connected 
by  a  direct  link  in  the  PE  interconnection  pattern.  For  example. 
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R(i^)  S(i)  is  valid  for  a  CCC  (recall  that  i^  is  obtained 
from  i  by  complementing  bit  b  in  the  binary  representation  of  i) . 
Each  assignment  of  this  type  is  a  unit-route. 

(3)  i^  will  denote  bit  b  in  the  binary  representation  of  i. 

(4)  PE  selectivity  can  be  done  using  a  mask.  The  mask  is  specified  in 
parenthesis  following  the  statement.  Some  examples  of  masks  are: 

(i)  (i^  ■  1):  this  enables  all  PEs  for  which  the  binary  represen¬ 
tation  of  the  PE  index  has  bit  b  equal  to  1. 

(ii)  (ij  t  i^) :  this  enables  all  PEs  for  which  the  j  th  bit  in  the 
binary  representation  of  the  PE  index  is  different  from  the 
kth  bit. 

When  no  mask  is  specified,  all  PEs  are  enabled.  Instructions  are 
executed  only  on  enabled  PEs. 

Procedure  BPC-CUBE  (Figure  4)  is  a  formal  statement  of  our  BPC  permu¬ 
tation  algorithm  for  CCCs.  The  loop  of  lines  1-16  searches  for  the  beginning 
of  a  bit  cycle.  If  |b^|  =  b,  we  have  a  cycle  of  length  1.  When  B^  =  b,  no 
work  needs  to  be  done.  When  B^  *  -b,  it  is  necessary  to  complement  along 
bit  b  (line  4).  If  |b^|  1  b,  then  we  are  at  the  start  of  a  cycle  of  length 
more  than  1.  Lines  7  to  12  follow  this  cycle,  j  is  used  to  move  along  b, 

|Bj,  | B j  |  |,  etc.  Line  9  puts  into  S(i)  the  data  to  be  routed  out  of  PE(i). 
b  ‘Bb ' 

Line  10  carries  out  the  route  along  bit  k  and  then,  in  line  11,  B^  is  set 
to  j  to  signify  that  the  cycle  containing  j  will  have  been  taken  care  of  by 
the  time  we  exit  from  the  case  statement,  j  is  moved  to  the  next  point  on 
the  cycle.  Line  14  moves  all  valid  data  to  the  R  registers. 

We  need  to  elaborate  upon  two  of  the  statements  just  made  concerning 
the  algorithm.  First,  we  need  to  show  that  line  9  moves  into  the  S  regis¬ 
ters  all  records  that  are  in  a  PE  whose  kth  bit  does  not  agree  with  the  kth 
bit  of  its  destination  PE.  Secondly,  we  need  to  show  that  line  14  correctly 
leaves  all  records  in  the  R  registers. 

Let  TR(i)  and  TS(i),  respectively,  denote  the  source  or  originating  PE 
for  the  records  currently  in  R(i)  and  S(i).  Let  AR(i)  and  AS(i)  denote  the 
destination  PEs  for  the  records  in  R(i)  and  S(i),  respectively.  At  the 
start  of  each  bit  cycle  (line  6)  all  records  are  in  R(i)  .  Let  TS(i)  -  $  and 
AS(i)  ■  So,  initially  at  line  8  we  have  (for  any  cycle):  (TR(i))j  “  i^ 
and  (TS(i))j  t  i j ,  0  s  i  <  N.  (We  shall  assume  that  all  relations  involving 
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$  are  true.  So,  is  true  and  ^  is  also  always  true.)  The 

relation  holds  since  no  routing  on  bit  j  could  have  been  performed  on  any 
previous  iteration  of  the  j^r  loop  of  lines  1  to  16.  If  i  0  then  for 
each  PE(i)  with  i^  f  i^,  we  have: 

(AR(i))k  =  <TR(i))  -  ±i  -  ik  and  (AS(i)>k  -  (TS(i))j  »  i j  =  ^ 

Hence,  R(i)  needs  to  be  routed  along  bit  k  and  S(i)  doesn’t.  If  i^  -  ik 
then  we  have: 


(AR(i) ),  =  (TR(i) ) . 


=  i,  s  i. 


,k  ‘j  =  ^  and  (AS(i))k  -  (TS(i))j  *k 

Hence,  S(i)  needs  to  be  routed  along  bit  k  and  R(i)  does  not.  So,  line  9 
correctly  places  into  the  S  registers  the  records  that  need  to  be  routed 
along  bit  k  when  >  0.  Using  a  similar  argument  one  can  show  the  correct¬ 
ness  of  line  9  when  <  0.  So,  following  line  11,  we  have: 


(AR(i))k  -  (AS(i))k 


ik,  0  £  i  <  N. 


procedure  BPC_CUBE  (A,n) 

//Permute  R  (0:2q-l)  according  to  the  BPC  permutation  B(0:q-1)// 
for  b:  ■  0  to  q-1  do 
case 


10 

11 

12 

13 

14 


:B.  =  b: 

D 

: B^  =  -b: 

:  |B.  |  ¥  b: 


do  nothing 
R(i(b))  R(i) 


j:=b;  s:=Bb 
repeat 

k  :=*  |  B j  |  //Next  route  is  along  dimension  k/ / 

//Put  outgoing  elements  in  S// 

li  Bj  >  0  thsn  S(i)  R(i) ,  (ij  ¥  V 


S(i(k))  S(i) 


else  S(i)  R(i),  (i,  =  i.) 

j  K. 


V 

until  j 
k  :*  I  s  I 


•j;  J 

■  b 


//s  is  the  initial  B, // 

D 

if  s  >  0  then  R(i)  S(i),  (i.  ¥  ±.  ) 

/>A  J* ***  D  K 

else  R(i):-S(i),  (t-i,) 

D  K 


end 

AAA 


15 

16  egj) 

end  BPC  CUBE 


Figure  4 
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Also,  note  that  preceding  the  execution  of  line  10, 

(TR(i))k  *  (TS(i))k  - 

as  no  routes  along  bit  k  have  yet  been  performed.  Line  10  routes  only  S. 

values,  so  after  line  10  we  shall  have: 

(TR(i))k  *  ik  and  (TS(i)>k  ^  ^  0  ^  i  <  N. 

As  a  result,  on  all  subsequent  iterations  of  the  loop  of  lines  7-12,  we 

shall  have  [(TR(i))^  *  i^  and  (TS(i))j  f  i  ^  ] ,  0  £  i  <  N,  at  line  8.  So, 

line  9  will  correctly  set  S(i)  and  line  10  will  route  correctly. 

The  preceding  argument  shows  that  when  the  loop  of  lines  7-12  is 

completed  for  any  b  then: 

(AR(i))q  =  (AS(i))q  =  iq 

for  all  q  e  { | B  | B .  i|,...,  b}. 

b  |Bbl 

It  remains  to  move  all  records  back  into  the  R  registers  (line  14). 
When  a  cycle  is  finished,  half  the  records  will  be  in  the  R  registers  and 
half  in  the  S  registers.  The  records  in  the  S  registers  need  to  be  moved 
to  the  R  registers.  The  first  time  line  9  is  executed  for  any  cycle, 
j  ■  b  and  k  =  |s|.  When  line  10  is  executed,  records  leave  half  the  PEs 
and  the  remaining  half  contain  two  records  each.  The  empty  PEs  are  those 


with  iu  t  it  if  Bu  >  0  and  i 
b  k  b  b 


i,  if  B,  <  0. 
k  b 


Since  bits  b  and  B^  do  not 


get  used  in  line  10  again  until  the  last  iteration  of  the  repeat-until  loop, 
these  PEs  remain  empty.  They  get  a  record  only  after  the  last  execution  of 
line  10  for  this  cycle.  At  this  time  k  a  b.  Thus,  the  PEs  containing 
records  in  their  S  registers  are  those  with  index  i  such  that  i^  ^  i|s|  ^ 
s  2  0  and  i^  *  i|g|  if  s  <  0  (note  that  s  =  B^) .  Hence,  lines  6  to  14 
correctly  handle  cycles  of  length  more  than  1  and  leave  all  records  in  the 
R  registers.  From  this  and  lines  3  and  4,  it  follows  that  all  cycles  are 
handled  correctly  and  BPC_CUBE  performs  every  BPC  permutation.  The  time 
complexity  of  BPC_CUBE  is  0(log  N)  and  the  number  of  unit-routes  (lines 
4  and  10)  is  6(A).  Hence,  BPC  CUBE  is  optimal. 


0 


3.  Conclusions 

We  have  presented  an  optimal  BPC  routing  algorithm  for  cube  connected 
computers.  Several  open  problems  remain.  Is  their  a  similarly  optimal 
algorithm  to  perform  BPC  permutations  on  perfect  shuffle  computers  (see 
[NASS81]  for  a  description  of  the  interconnection  network  used  here)?  Can 
we  develop  optimal  algorithms  for  other  classes  of  permutations  such  as 
omega  and  inverse  omega  permutations  ([LAWR75])  etc.? 
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